Precise Zero-Shot Dense Retrieval without Relevance Labels
it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance label is available. (Abstract)
#HyDE (Hypothetical Document Embeddings) Given a query, HyDE first zero-shot instructs an instruction-following language model (e.g. InstructGPT) to generate a hypothetical document.
Then, an unsupervised contrastively learned encoder~(e.g. Contriever) encodes the document into an embedding vector.
Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers, across various tasks (e.g. web search, QA, fact verification) and languages~(e.g. sw, ko, ja).
jaデータセット?
https://github.com/texttron/hyde/blob/main/approach.png?raw=true
4.1 Setup (4 Experiments)
Datasets
web search query sets (-> 4.2, Table 1)
diverse collection of 6 low-resource datasets (-> 4.3, Table 2)